Module 1, Lecture 1: Bioinformatics

M Hallett
January 2015

COMP-364 Tools for the Life Sciences

Logistics

www.bci.mcgill.ca My lab website
www.bci.mcgill.ca/home/?page_id=811 Course Website

TRF 10:35am-11:25am
ENGTR – Trottier Building 2110
Jan 5th 2015 – April 14th 2015

\( {\tt michael.t.hallett@mcgill.ca} \)
Office: Bellini 434
Office Hours: TBA

Daniel Del Balso, Teaching Assistant
\( {\tt daniel.delbalso2@mail.mcgill.ca} \)
Office: Bellini 432
Office Hours: TBA

Course Evaluation & Schedule

Exercise Due Date % of Grade
Assignment 0 Friday, January 16th, 2015 10%
Assignment 1 Tuesday February 3rd, 2015) 10%
Assignment 2 Tuesday, February 17th, 2015 10%
Midterm February 27th, 2015 20%
Assignment 3 March 10th, 2015 10%
Assignment 4 March 24th, 2015 10%
Final Exam TBA 30%

Course Infrastucture

  • If you are registered for this course, you will have a SOCS (School of Comp Sci) account.
  • The SOCS account provides you access to the SOCS workstations and server.
  • We (Daniel and I) have installed the software and datasets for this course on the SOCS infrastructure.
  • There is a focus on the basic biology, statistics and computation related to breast cancer.
  • This requires that you understand or learn how to program, that you learn some statistics and that you understand the nature of systems biology or -omic data.
  • This is a heavy programming course, if you have never programmed before.
  • We are mostly going to use data made available as part of the Breast Cancer TCGA dataset.
  • You will also need to become familiar with RStudio, a programming environment to deal with R and datasets.
  • You will also have to become a little bit familiar with a software versioning system called GIT.
  • We will make all the slides, data, code and assignments available via a GIT repository.

What is Bioinformatics?

  • The science of biological information
  • Managing biological information is a part of bioinformatics but not all
  • Examples:

Seque: The Importance of Being a Bioinformatician

  • Bioinformatics such as PubMed, GenBank, dbGap and other tools facilitate the study of specific genes and gene products by life scientists.
  • Consider what life science research looked like before circa 1990 (25 years ago).
  • At that time, the vast majority of basic life sci researchers studied a single gene (or gene product), or at most a single complex (e.g. ribosome).
  • In '90, How did an ESR (Estrogen Receptor) researcher “track” new results?
  • Internet popularized in ~'92. PubMed released in '96. GenBank '88. dbGap '07.
  • A lot of actual walking to a library and looking up keywords at the back of a journal that seemed likely to publish results about ESR.
  • It was the dark ages.

Seque (2): And the Geeks shall inherit the earth

  • Then came the internet in '92. General interconnectivity. Within 5 years, every major journal was publishing their papers on-line.
  • Then came PubMed.
  • Now virtually every published medical/life science paper was available instantly.
  • The ability to search text for keywords (eg ESR, estrogen) wihtin english text allowed single genes/gene products to be followed closely.

Seque (3): And the Geeks shall inherit the earth

PubMed.home

Seque (4): And the Geeks shall inherit the earth

PubMed.search

Nice Tool: PubMed Automated Searches

  • Daily email updates for a keyword search. PubMed.autosearch

Seque (5): And the Geeks shall inherit the earth

  • So Pubmed allowed researchers to identify papers that mentioned a gene (eg ESR).

  • (A lot of Principle Investigators (PIs) still primarily and only use PubMed to track their genes.)

  • But what about results related to ESR that are derived from -omic/systems biology efforts.

  • Eg. every time an individual is sequenced, their ESR gene is sequenced and any mutations in this gene add to the global pool of polymorphisms?

  • Eg. every time a higher-order eukaroyte is sequenced, a homologue of ESR is sequenced (and may not be named ESR)?

  • Eg. every time a gene expression microarray is performed on a human sample, ESR levels are of course measured, since these microarrays cover the complete transcriptome?

  • Eg. every time a mass spectrometry experiment is performed to identify proteins or protein interactions, ESR will have be measured too?

Seque (6): And the Geeks shall inherit the earth

  • GenBank '88, dbGap '07 and many other databases provide all of this information.

  • Can researchers afford to ignore this information and only look at the primary research?

  • Bioinformatic software was necessary to perform complicated, statistical searches that allow researchers to track their genes in these datasets.

Oncomine

What is Bioinformatics? (2)

  • The science of biological information
  • Managing biological information is a part of bioinformatics but not all

  • Bioinformatics is also the investigation of biological systems using tools from information science.

  • Often this is about hypothesis testing and biomarker discovery.

  • For example, the development of gene panels like Oncotype DX www.oncotypedx.com

  • For example, my lab considers itself to be a breast cancer research lab whose primary assay is bioinformatics (as opposed to pull downs, PCR, microarray or other assays).

What is Bioinformatics? (3)

  • And often this is about model building (exmaples in next slides)
  • For example, models of the genome, exome, transcriptome, proteome, protein interactome, methylomes, epigenome, … and many other -omic entities.

  • Hypothesis testing, biomarkers, and model building all require a tremendous amount of tools from biostatistics and computation.

The Human Genome Project

Human Genome in Nature

Human Genome Project

Human Genome Project

Then the 1,000 and 10,000 Genome Projects...

1K Genomes

10K Genomes

Catalogues of "functional genomic" information

HapMap The HapMap project that catalogs single nucleotide polymorphisms and other mutations in human populations.

Transcriptome Gene Expression Omnibus and other efforts seek to catalogue transcriptional (mRNA expression levels)

Catalogues of "post-genomic" information

Epigenome The Epigenome project that is attempting to catalogue all epigenetic modifications (e.g. methylation) in different types of human cells (e.g neuronal versus epithelial vs fibroblasts vs endothelial etc.).

PPINetworks Networks that capture which pairs of proteins interact within a cell or organism. Here this is a bacteria (Treponema palladium). Nodes are proteins and edges (lines) connect two proteins that have been determined to interact. Interactions can be between proteins within a complex (e.g. proteins that comprise the ribosome), proteins that phosphorylate other proteins within signalling cascasdes, protein chaperones that help other proteins fold, or …

Catalogues versus Models

  • The above examples weren't really bioinformatics specific, but rather genomics, proteomics, systems biology or other -omic challenges.
  • These are in large part about technology (to sequence in a massively parallel fashion, mass spectrometry for proteomics/metabolomics, NMR/crystalography, …)
  • However they do use the cataloging/organizational aspects that bioinformatics offers.
  • In addition to simply collecting and organizing information, the main aim of bioinformatics is to model biological processes…
  • … often using the information provided by these -omic projects.
  • Bioinformatics is a predictive science: can we build a model that accurately predicts how a biological system (cell line, tissue culture, mouse model, bacteria, or human system) will behave?

Consider, the Genetic Code

Historically,biological models are simple and deterministic

Genetic Code Wheel

Genetic Code

The Double Helix Model of DNA

Genetic Code Wheel

Genetic Code

There appear to be very few such simple examples

Unrecognized slide field: http

COMP-364 (c) M Hallett, BCI-McGill

BCI-McGill